Titanic Survival Analysis:

First steps:

First, we need to import all the libraries needed for the analysis and load the data file:


In [1]:
# Import the libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy

In [2]:
# Read the csv file
titanic = pd.read_csv("titanic-data.csv")

The next step is to explore the dataset:


In [3]:
titanic.shape


Out[3]:
(891, 12)

In [4]:
titanic.columns


Out[4]:
Index([u'PassengerId', u'Survived', u'Pclass', u'Name', u'Sex', u'Age',
       u'SibSp', u'Parch', u'Ticket', u'Fare', u'Cabin', u'Embarked'],
      dtype='object')

In [5]:
titanic


Out[5]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
... ... ... ... ... ... ... ... ... ... ... ... ...
861 862 0 2 Giles, Mr. Frederick Edward male 21.0 1 0 28134 11.5000 NaN S
862 863 1 1 Swift, Mrs. Frederick Joel (Margaret Welles Ba... female 48.0 0 0 17466 25.9292 D17 S
863 864 0 3 Sage, Miss. Dorothy Edith "Dolly" female NaN 8 2 CA. 2343 69.5500 NaN S
864 865 0 2 Gill, Mr. John William male 24.0 0 0 233866 13.0000 NaN S
865 866 1 2 Bystrom, Mrs. (Karolina) female 42.0 0 0 236852 13.0000 NaN S
866 867 1 2 Duran y More, Miss. Asuncion female 27.0 1 0 SC/PARIS 2149 13.8583 NaN C
867 868 0 1 Roebling, Mr. Washington Augustus II male 31.0 0 0 PC 17590 50.4958 A24 S
868 869 0 3 van Melkebeke, Mr. Philemon male NaN 0 0 345777 9.5000 NaN S
869 870 1 3 Johnson, Master. Harold Theodor male 4.0 1 1 347742 11.1333 NaN S
870 871 0 3 Balkic, Mr. Cerin male 26.0 0 0 349248 7.8958 NaN S
871 872 1 1 Beckwith, Mrs. Richard Leonard (Sallie Monypeny) female 47.0 1 1 11751 52.5542 D35 S
872 873 0 1 Carlsson, Mr. Frans Olof male 33.0 0 0 695 5.0000 B51 B53 B55 S
873 874 0 3 Vander Cruyssen, Mr. Victor male 47.0 0 0 345765 9.0000 NaN S
874 875 1 2 Abelson, Mrs. Samuel (Hannah Wizosky) female 28.0 1 0 P/PP 3381 24.0000 NaN C
875 876 1 3 Najib, Miss. Adele Kiamie "Jane" female 15.0 0 0 2667 7.2250 NaN C
876 877 0 3 Gustafsson, Mr. Alfred Ossian male 20.0 0 0 7534 9.8458 NaN S
877 878 0 3 Petroff, Mr. Nedelio male 19.0 0 0 349212 7.8958 NaN S
878 879 0 3 Laleff, Mr. Kristo male NaN 0 0 349217 7.8958 NaN S
879 880 1 1 Potter, Mrs. Thomas Jr (Lily Alexenia Wilson) female 56.0 0 1 11767 83.1583 C50 C
880 881 1 2 Shelley, Mrs. William (Imanita Parrish Hall) female 25.0 0 1 230433 26.0000 NaN S
881 882 0 3 Markun, Mr. Johann male 33.0 0 0 349257 7.8958 NaN S
882 883 0 3 Dahlberg, Miss. Gerda Ulrika female 22.0 0 0 7552 10.5167 NaN S
883 884 0 2 Banfield, Mr. Frederick James male 28.0 0 0 C.A./SOTON 34068 10.5000 NaN S
884 885 0 3 Sutehall, Mr. Henry Jr male 25.0 0 0 SOTON/OQ 392076 7.0500 NaN S
885 886 0 3 Rice, Mrs. William (Margaret Norton) female 39.0 0 5 382652 29.1250 NaN Q
886 887 0 2 Montvila, Rev. Juozas male 27.0 0 0 211536 13.0000 NaN S
887 888 1 1 Graham, Miss. Margaret Edith female 19.0 0 0 112053 30.0000 B42 S
888 889 0 3 Johnston, Miss. Catherine Helen "Carrie" female NaN 1 2 W./C. 6607 23.4500 NaN S
889 890 1 1 Behr, Mr. Karl Howell male 26.0 0 0 111369 30.0000 C148 C
890 891 0 3 Dooley, Mr. Patrick male 32.0 0 0 370376 7.7500 NaN Q

891 rows × 12 columns

We can see that Passenger ID, Name and Cabin have little value to the analysis, so we drop these columns off the dataset:


In [6]:
titanic = titanic.drop(['PassengerId','Name','Ticket', 'Cabin', 'Embarked'], axis=1)

In [7]:
titanic['Survived'].describe()


Out[7]:
count    891.000000
mean       0.383838
std        0.486592
min        0.000000
25%        0.000000
50%        0.000000
75%        1.000000
max        1.000000
Name: Survived, dtype: float64

Data cleaning:

We can see that both the Age column has a lot of NAs. We would need to fill in the blank with random values generated within their standardized value.


In [8]:
titanic['Age'].describe()


Out[8]:
count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [9]:
average_age = titanic["Age"].mean()
std_age = titanic["Age"].std()
count_nan_age = titanic["Age"].isnull().sum()
# generate random numbers between (mean - std) & (mean + std)
rand = np.random.randint(average_age - std_age, average_age + std_age, size = count_nan_age)

In [10]:
# Fill NAs in age with median age
titanic['Age'][np.isnan(titanic["Age"])] = rand


C:\ProgramData\Anaconda2\lib\site-packages\ipykernel\__main__.py:2: SettingWithCopyWarning: 
A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  from ipykernel import kernelapp as app

In [11]:
titanic['Age'].describe()


Out[11]:
count    891.000000
mean      29.538911
std       13.523755
min        0.420000
25%       21.000000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [12]:
sns.distplot(titanic['Age'])
plt.show()


Someone's family size would be equal to their number of spouses/siblings and parents/children on the ship, plus themselves:


In [13]:
# Family size
titanic['Family_size'] = titanic['SibSp'] + titanic['Parch'] + 1

Now we would extract the survived dataset for future analysis:


In [14]:
survived = titanic[titanic['Survived'] == 1]

Questions:

According to Wikipedia, "Women and children first" is a code of conduct dating from 1860, whereby the lives of women and children were to be saved first in a life-threatening situation, typically abandoning ship, when survival resources such as lifeboats were limited. The wiki page actually gives some insights and statistics on the survival rate of the Titanic; however, in this analysis, I would reconfirm them, and attempt to find out which other factors that determine the survival rate in the Titanic tragedy.

The questions I am going to answer in this analysis are:

  1. Was there really a "Women and children first" rule on the Titanic?
  2. Did other factors such as wealth/classes and family sizes affect someone's chance of survival?

Women and children first?

Assuming people are neutral on the gender of a kid, I would split the passengers into 3 types:


In [15]:
def passenger_type(person):
    if person['Age'] <= 16:
        return "child"
    elif person['Sex'] == "female":
        return "female_adult"
    else:
        return "male_adult"

titanic['Type'] = titanic.apply(passenger_type, axis = 1)
titanic


Out[15]:
Survived Pclass Sex Age SibSp Parch Fare Family_size Type
0 0 3 male 22.0 1 0 7.2500 2 male_adult
1 1 1 female 38.0 1 0 71.2833 2 female_adult
2 1 3 female 26.0 0 0 7.9250 1 female_adult
3 1 1 female 35.0 1 0 53.1000 2 female_adult
4 0 3 male 35.0 0 0 8.0500 1 male_adult
5 0 3 male 16.0 0 0 8.4583 1 child
6 0 1 male 54.0 0 0 51.8625 1 male_adult
7 0 3 male 2.0 3 1 21.0750 5 child
8 1 3 female 27.0 0 2 11.1333 3 female_adult
9 1 2 female 14.0 1 0 30.0708 2 child
10 1 3 female 4.0 1 1 16.7000 3 child
11 1 1 female 58.0 0 0 26.5500 1 female_adult
12 0 3 male 20.0 0 0 8.0500 1 male_adult
13 0 3 male 39.0 1 5 31.2750 7 male_adult
14 0 3 female 14.0 0 0 7.8542 1 child
15 1 2 female 55.0 0 0 16.0000 1 female_adult
16 0 3 male 2.0 4 1 29.1250 6 child
17 1 2 male 18.0 0 0 13.0000 1 male_adult
18 0 3 female 31.0 1 0 18.0000 2 female_adult
19 1 3 female 34.0 0 0 7.2250 1 female_adult
20 0 2 male 35.0 0 0 26.0000 1 male_adult
21 1 2 male 34.0 0 0 13.0000 1 male_adult
22 1 3 female 15.0 0 0 8.0292 1 child
23 1 1 male 28.0 0 0 35.5000 1 male_adult
24 0 3 female 8.0 3 1 21.0750 5 child
25 1 3 female 38.0 1 5 31.3875 7 female_adult
26 0 3 male 24.0 0 0 7.2250 1 male_adult
27 0 1 male 19.0 3 2 263.0000 6 male_adult
28 1 3 female 25.0 0 0 7.8792 1 female_adult
29 0 3 male 31.0 0 0 7.8958 1 male_adult
... ... ... ... ... ... ... ... ... ...
861 0 2 male 21.0 1 0 11.5000 2 male_adult
862 1 1 female 48.0 0 0 25.9292 1 female_adult
863 0 3 female 41.0 8 2 69.5500 11 female_adult
864 0 2 male 24.0 0 0 13.0000 1 male_adult
865 1 2 female 42.0 0 0 13.0000 1 female_adult
866 1 2 female 27.0 1 0 13.8583 2 female_adult
867 0 1 male 31.0 0 0 50.4958 1 male_adult
868 0 3 male 28.0 0 0 9.5000 1 male_adult
869 1 3 male 4.0 1 1 11.1333 3 child
870 0 3 male 26.0 0 0 7.8958 1 male_adult
871 1 1 female 47.0 1 1 52.5542 3 female_adult
872 0 1 male 33.0 0 0 5.0000 1 male_adult
873 0 3 male 47.0 0 0 9.0000 1 male_adult
874 1 2 female 28.0 1 0 24.0000 2 female_adult
875 1 3 female 15.0 0 0 7.2250 1 child
876 0 3 male 20.0 0 0 9.8458 1 male_adult
877 0 3 male 19.0 0 0 7.8958 1 male_adult
878 0 3 male 38.0 0 0 7.8958 1 male_adult
879 1 1 female 56.0 0 1 83.1583 2 female_adult
880 1 2 female 25.0 0 1 26.0000 2 female_adult
881 0 3 male 33.0 0 0 7.8958 1 male_adult
882 0 3 female 22.0 0 0 10.5167 1 female_adult
883 0 2 male 28.0 0 0 10.5000 1 male_adult
884 0 3 male 25.0 0 0 7.0500 1 male_adult
885 0 3 female 39.0 0 5 29.1250 6 female_adult
886 0 2 male 27.0 0 0 13.0000 1 male_adult
887 1 1 female 19.0 0 0 30.0000 1 female_adult
888 0 3 female 19.0 1 2 23.4500 4 female_adult
889 1 1 male 26.0 0 0 30.0000 1 male_adult
890 0 3 male 32.0 0 0 7.7500 1 male_adult

891 rows × 9 columns


In [16]:
titanic['Type'].value_counts()


Out[16]:
male_adult      517
female_adult    263
child           111
Name: Type, dtype: int64

In [17]:
sns.set(style="darkgrid")
ax = sns.countplot(x="Type", data = titanic)
plt.show()


We can see that male adults are the initial largest type of people on the ship, followed by female adults and child.

Now looking into the survival rate:


In [18]:
survived = titanic[titanic['Survived'] == 1]
non_survived = titanic[titanic['Survived'] == 0]

In [19]:
survived['Type'].value_counts()


Out[19]:
female_adult    199
male_adult       87
child            56
Name: Type, dtype: int64

In [20]:
non_survived['Type'].value_counts()


Out[20]:
male_adult      430
female_adult     64
child            55
Name: Type, dtype: int64

In [21]:
sns.set(style="darkgrid")
ax = sns.countplot(x="Survived", hue = "Type", data = titanic)
plt.show()


Comparing to the initial number of people of each type, we can see that children have more than 50% survival rate, female adults have an impressive survival rate around 75%, while male adults have a small survival rate of around 16% comparing to their intial numbers. So we can see that there was an inherent "women and children first" code when it came to saving people on the ship.


In [22]:
sns.distplot(survived['Age'])
plt.show()


The histogram of the age distribution of the survival group also confirms that younger people had a higher advantage in survival comparing to older ages.

Socio-economic classes:

We can assume that someone's class on the Titanic represented their socio-economic status. Also, we would assume that the fares have a direct correlation with the classes; so we only need to examine one of them.


In [23]:
titanic['Pclass'].value_counts()


Out[23]:
3    491
1    216
2    184
Name: Pclass, dtype: int64

In [24]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", data = titanic)
plt.show()


Approximately 55% of the passengers belonged to the third class, while the rest of the ship belong to the first and second classes. Now we'll see if the first and second class passengers also paid a premimum when it comes to safety?


In [25]:
survived['Pclass'].value_counts()


Out[25]:
1    136
3    119
2     87
Name: Pclass, dtype: int64

In [26]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Survived", data = titanic)
plt.show()


The survival rate of the first class passengers was more than 60%, while the survival rate of the third class ones was merely around 25%. So we can see that there was a bias on weathiness and soci-economic statuses, even in life-threatning situations.

Now, what if we factor in both passenger classes and types (male, female or children), which would have more weight in survival rate?


In [27]:
titanic.groupby(['Pclass', 'Type']).Type.count()


Out[27]:
Pclass  Type        
1       child            12
        female_adult     88
        male_adult      116
2       child            21
        female_adult     66
        male_adult       97
3       child            78
        female_adult    109
        male_adult      304
Name: Type, dtype: int64

In [28]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Type", data = titanic)
plt.show()



In [29]:
titanic.groupby(['Pclass', 'Type']).agg({'Survived': 'sum'})


Out[29]:
Survived
Pclass Type
1 child 8
female_adult 86
male_adult 42
2 child 19
female_adult 60
male_adult 8
3 child 29
female_adult 53
male_adult 37

In [30]:
sns.set(style="darkgrid")
ax = sns.countplot(x = "Pclass", hue = "Type", data = survived)
plt.show()


We can see the women and children of the first class had a significantly impressive survival rate (more than 90% and 80% respectively), when the women and children of the third class had a much lower survival rate (more than 45% and around 40% respectively). However, the women and children from the third class did have a higher survival rate than the men from higher classes. Men from the first class had a survival rate of around 35%, which was actually below the overall survival rate of 38.38%. Men from the second and third classes suffered very low survival rates, which was around 8 % and around 12 % respectively comparing to their initial numbers.

Family size:

Did people have a higher chance of survival if they traveled with family rather than traveling alone? We'll find out.


In [31]:
titanic['Family_size'].value_counts()


Out[31]:
1     537
2     161
3     102
4      29
6      22
5      15
7      12
11      7
8       6
Name: Family_size, dtype: int64

We can see that the majority of the ship traveled by themselves, followed by families of 2 or 3. The families that had more than 3 members made up a small part of the ship. Now look into the survival statistics:


In [32]:
survived['Family_size'].value_counts()


Out[32]:
1    163
2     89
3     59
4     21
7      4
6      3
5      3
Name: Family_size, dtype: int64

In [33]:
sns.boxplot(x="Survived", y="Family_size", data=titanic)
plt.show()



In [34]:
sns.kdeplot(survived['Family_size'], shade=True)
plt.show()


Both the boxplot and the distribution curve shows that small-sizing families (under 4) made up around 75% of the survivals. Big families seem to have been penalized harshly on survival rate.

Conclusion:

In this analysis, we can see that there was a clear trend of "Women and children first" when it came to helping and rescuing people from the Titanic. The data also suggests an impact of soci-economic classes and family sizes on someone's chance of survival, although they didn't have as much impact as the "women and children first" rule. Also, women and children from lower classes still had a better chance of survival than men from lower classes.

The analysis has a few limitations. A lot of values were missing in the age sections, and randomized numbers must create a margin of error in the analysis. If I have more knowledge to make use of variables such as names, embarked or cabins, the analysis would also be improved for the better.